Method and apparatus for multiply instructions in data processors

ABSTRACT

The disclosed embodiments relate to apparatus for accurately, efficiently and quickly executing a multiplication instruction. The disclosed embodiments can provide a multiplier module having an optimized layout that can help speed up computation of a result during a multiply operation so that cycle delay can be reduced and so that power consumption can be reduced.

TECHNICAL FIELD

Embodiments of the subject matter described herein relate generally todata processors that execute instructions. More particularly,embodiments of the subject matter relate to a multiplier module forexecuting a multiply instruction.

BACKGROUND

A processing core can include multiple data processors that executeprogram instructions by performing various arithmetic operations.Examples of arithmetic operations that can be performed by a dataprocessor (such as, for example, a CPU, GPU or combined CPU and GPUoften referred to as accelerated processing units (APUs), and the like)include addition, multiplication, division, and the like. In addition,some processors can support more complex operations. For instance, oneexample is a multiply-and-accumulate (MAC) operation that computes theproduct of two numbers and adds that product to another number.

One conventional type of multiplier module includes booth encodercircuitry that is used to process a multiplier operand and generatecontrol signals that are provided to corresponding booth multiplexers.The booth multiplexers use these control signals to process themultiplicand operand and generate partial products that are thenprovided to a compression tree. The compression tree includes acarry-save adder (CSA) array and a carry-save adder (CSA) coupled to theCSA array. The CSA array has inputs configured to receive the partialproducts, and includes a number of carry save adders (CSAs) implementedat different compressor levels for compressing the partial products togenerate a sum and carry output.

The background description provided herein is for the purpose ofgenerally presenting the context of the disclosure. Work of thepresently named inventors, to the extent it is described in thisbackground section, as well as aspects of the description that may nototherwise qualify as prior art at the time of filing, are neitherexpressly nor impliedly admitted as prior art against the presentdisclosure.

BRIEF SUMMARY OF EMBODIMENTS

In accordance with some of the disclosed embodiments, a processor isprovided that is configured to process a multiplicand operand and amultiplier operand. The processor includes a partial product arrayhaving a folded layout that is split into a low-side and a high-side.The partial product array includes a partial product generator, and apartial product reducer.

The partial product generator includes a plurality of booth encoders anda plurality of booth multiplexers. The booth encoders are arranged alonga substantially diagonal path that extends between the high-side and thelow-side. Each of the booth encoders is configured to perform boothencoding on a portion of the multiplier operand generate a particularselect signal. Each of the booth multiplexers is configured to receive aparticular version of the multiplicand operand, and one of theparticular select signals generated by one of the booth encoders, and togenerate a partial product such that the plurality of booth multiplexerscollectively generate a plurality of partial products.

The partial product reducer is configured to receive and reduce theplurality of partial products to generate result outputs. The partialproduct reducer includes a carry-save adder array. The carry-save adderarray includes a plurality of compressor levels that are interleavedwith the plurality of booth multiplexers.

In accordance with one particular implementation of the disclosedembodiments, a multiplier module is provided for digitally multiplying amultiplicand operand by a multiplier operand. The multiplier moduleincludes a partial product array having a folded layout that is splitinto a low-side and a high-side. The partial product array includes aplurality of booth encoders, a plurality of booth multiplexers and acarry-save adder array. Each of the booth encoders are configured togenerate a particular select signal based on a portion of the multiplieroperand. The booth encoders are arranged along a substantially diagonalpath that extends between the high-side and the low-side. The boothmultiplexers that collectively generate a plurality of partial productsbased on the multiplicand operand and the select signals generated bythe booth encoders. The carry-save adder array includes a plurality ofcompressor levels that are interleaved with the booth multiplexers. Thecompressor levels include a plurality of carry-save adders that are eacharranged in an interleaved manner with the booth multiplexers.

This summary is provided to introduce a selection of concepts in asimplified form that are further described below in the detaileddescription. This summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

BRIEF DESCRIPTION OF THE DRAWINGS

A more complete understanding of the subject matter may be derived byreferring to the detailed description and claims when considered inconjunction with the following figures, wherein like reference numbersrefer to similar elements throughout the figures.

FIG. 1A is a logical block diagram illustrating a portion of anarithmetic processing unit that can be in accordance with some of thedisclosed embodiments.

FIG. 1B is a logical block diagram illustrating a partial productgenerator in accordance with one specific embodiment of the presentdisclosure.

FIG. 1C is a logical block diagram illustrating a compression treearchitecture of a partial product reducer in accordance with oneembodiment of the present disclosure.

FIG. 2 is a block diagram illustrating the physical layout of a partialproduct array including the arrangement of booth encoders, boothmultiplexers, and compressors in accordance with a specific embodimentof the present disclosure.

FIG. 3 is a block diagram illustrating one example of the physicallayout of booth multiplexers with respect to a carry-save adder inaccordance with a specific embodiment of the present disclosure.

DETAILED DESCRIPTION

The following detailed description is merely illustrative in nature andis not intended to limit the embodiments of the subject matter or theapplication and uses of such embodiments. As used herein, the word“exemplary” means “serving as an example, instance, or illustration.”Any implementation described herein as exemplary is not necessarily tobe construed as preferred or advantageous over other implementations.Furthermore, there is no intention to be bound by any expressed orimplied theory presented in the preceding technical field, background,brief summary or the following detailed description.

Techniques and technologies may be described herein in terms offunctional and/or logical block components and with reference tosymbolic representations of operations, processing tasks, and functionsthat may be performed by various computing components or devices. Itshould be appreciated that the various block components shown in thefigures may be realized by any number of hardware, software, and/orfirmware components configured to perform the specified functions. Forexample, an embodiment of a system or a component may employ variousintegrated circuit components, e.g., memory elements, digital signalprocessing elements, logic elements, look-up tables, or the like, whichmay carry out a variety of functions under the control of one or moremicroprocessors or other control devices.

For the sake of brevity, conventional techniques related to functionalaspects of the devices and systems (and the individual operatingcomponents of the devices and systems) may not be described in detailherein. Furthermore, the connecting lines shown in the various figurescontained herein are intended to represent example functionalrelationships and/or physical couplings between the various elements. Itshould be noted that many alternative or additional functionalrelationships or physical connections may be present in an embodiment.

Definitions

As used herein, the term “instruction set architecture” refers to a partof the computer architecture related to programming, including thenative data types, instructions, registers, addressing modes, memoryarchitecture, interrupt and exception handling, and external I/O. Aninstruction set architecture includes a specification of a set ofmachine language “instructions.”

As used herein, the term “instruction” refers to an element of anexecutable program provided to a processor by a computer program thatdescribes an operation that is to be performed or executed by theprocessor. An instruction may define a single operation of aninstruction set. Types of operations include, for example, arithmeticoperations, data copying operations, logical operations, and programcontrol operation, as well as special operations, such as permuteoperations. A complete machine language instruction includes anoperation code or “opcode” and, optionally, one or more operands.

As used herein, the term “operand” refers to the part of an instructionwhich specifies what data is to be manipulated or operated on, while atthe same time also representing the data itself In other words, anoperand is the part of the instruction that references the data on whichan operation (specified by an opcode) is to be performed. Operands mayspecify literal data (e.g., constants) or storage areas (e.g., addressesof registers or other memory locations in main memory) that contain datato be used in carrying out the instruction. As used herein, the term“opcode” refers to a portion of a machine language instruction thatspecifies or indicates which operation (or action) is to be performed bya processor on one or more operands. For example, an opcode may specifyan arithmetic operation to be performed, such as “add contents of memoryto register,” and may also specify the precision of the result that isdesired. The specification and format for opcodes are defined in theinstruction set architecture for a processor (which may be a general CPUor a more specialized processing unit).

As used herein, a “node” means any internal or external reference point,connection point, junction, signal line, conductive element, or thelike, at which a given signal, logic level, voltage, data pattern,current, or quantity is present. Furthermore, two or more nodes may berealized by one physical element (and two or more signals can bemultiplexed, modulated, or otherwise distinguished even though receivedor output at a common node).

The following description refers to elements or nodes or features being“connected” or “coupled” together. As used herein, unless expresslystated otherwise, “coupled” means that one element/node/feature isdirectly or indirectly joined to (or directly or indirectly communicateswith) another element/node/feature, and not necessarily mechanically.Likewise, unless expressly stated otherwise, “connected” means that oneelement/node/feature is directly joined to (or directly communicateswith) another element/node/feature, and not necessarily mechanically. Inaddition, certain terminology may also be used in the followingdescription for the purpose of reference only, and thus are not intendedto be limiting. For example, terms such as “first,” “second,” and othersuch numerical terms referring to elements or features do not imply asequence or order unless clearly indicated by the context.

FIG. 1A is a block diagram illustrating a portion of an arithmeticprocessing unit 200 in accordance with some of the disclosedembodiments. FIG. 1A illustrates a particular example of one specificembodiment in which the arithmetic processing unit is configured toexecute a floating point multiply-and-accumulate instruction in a packedsingle-precision mode as well as higher precision modes (e.g., wheremultiplier, multiplicand and addend operands have a “packed single”format and two single-precision operations are performed in parallel).However, in a more general sense, as will be explained below, in someimplementations, the illustrated portion of the arithmetic processingunit 200 can be used to execute a multiply instruction. Execution ofeither instruction involves digitally multiplying a N-bit widemultiplicand operand by a N-bit wide multiplier operand withoutperforming an accumulate operation.

As will be appreciated by those skilled in the art, the arithmeticprocessing unit includes mantissa module (or datapath) that performsmathematical operations on the mantissa of the received operands and anexponent module (or datapath) that performs mathematical operations onthe exponent portions of the floating-point operands. The mantissamodule and the exponent module perform their operations in asubstantially parallel manner. For sake of clarity, FIG. 1A illustratesthe mantissa datapath of the arithmetic processing unit 200 andhighlights how the mantissa datapath is configured to execute twoconcurrent single-precision Fused Multiply-and-Accumulate (FMAC)operations. FIG. 1A does not illustrate the exponent datapath since itis not germane to the embodiments that will be described below.

Further, as will be appreciated by those skilled in the art, it is notedthat a packed single format contains two individual single-precisionvalues. The first (low) value includes a twenty-four bit mantissa thatis right justified in the 64-bit operand field, and the second (high)value includes another twenty-four bit mantissa that is left justifiedin the 64-bit operand field, with sixteen zeros included between the twosingle-precision values. In this regard, it is noted that although theAPU is illustrated as performing two single-precision operations inparallel using a “packed single” format, the APU can performextended-precision, double-precision, and single-precision operations.

The illustrated portion of the arithmetic processing unit 200 includesoperand registers 120, 122, and 124, result register 126, a partialproduct array 240, 245 that includes a partial product generator 240 anda partial product reducer 245 (that includes a carry-save adder (CSA)array 250 and a carry-save adder (CSA) 280), a sign control 260, acomplement module 270 that includes portions 270-1 and 270-2, alignmentmodules 272, 274, 276, leading zero anticipator (LZA) modules 282, 284,286, a carry-propagate adder (CPA) 290, normalizer modules 292, 293, androunder modules 296, 297.

Operand Registers and Operand Format

Although not illustrated in FIG. 1A, the APU is coupled to a memorysource that provides a multiplier operand, a multiplicand operand andoptionally an addend operand. The memory source can be, for example, aplurality of register files or a memory bank.

Operand registers 120, 122, and 124 can each contain data values thatcan be floating point numbers having either a single-precision,double-precision, extended-precision or packed single-precision format.In the embodiments that will be described below, each operand register120, 122, 124 contains two single-precision operands in a packedsingle-precision format (i.e., two individual single-precision valuesseparated by zeros). An input A (not illustrated) can be provided tooperand register 120, an input B (not illustrated) can be provided tooperand register 122, an input C (not illustrated) can be provided tooperand register 124, and an output result is provided to register 126.Inputs A and B are a multiplicand and a multiplier, respectively, andinput C is an addend.

An instruction register (not illustrated) can contain an instruction(also referred to as an operation code and abbreviated as “opcode”),which identifies the instruction that is to be executed. The opcodespecifies not only the arithmetic operation to be performed, but alsothe precision of the result that is desired. In accordance with some ofthe disclosed embodiments that will be described below, it is presumedthat the instruction register provides a fused multiply-and-accumulate(FMAC) instruction/opcode so that a FMAC operation can be executed withrespect to operands having a packed single-precision, floating pointformat.

A control module (not illustrated) has an input to receive aninstruction from instruction register and can provide control signals tothe arithmetic processing unit 200 to perform a multiply operation or amultiply-and-accumulate operation. For example, control module, uponreceiving a fused multiply-and-accumulate (FMAC) instruction/opcode, canconfigure the portion of the arithmetic processing unit 200 to performthe indicated computation and to provide a packed single-precisionresult. Moreover, the control signal from the control module canconfigure the arithmetic processing unit 200 to interpret each of inputvalues A, B, C as representing an operand of any of the supportedprecision modes, and more specifically, in this case, as representingoperands of the packed single-precision mode.

These operands can be processed to perform a multiply operation or amultiply-and-accumulate operation as specified by an instruction togenerate a result that is provided to result register 126. To execute amultiply-and-accumulate instruction, such as floating-pointmultiply-and-accumulate operation, operands A and B are multipliedtogether to provide a product, and operand C is added to the product. Amultiply instruction, such as a floating-point multiply (FMUL), isexecuted in substantially the same way except operand C is set to avalue of zero.

As noted above, operand data can have a packed single-precision formatin which the operand data is split into high and low portions or partsthat are processed separately. Because the arithmetic processing unit200 is configured to execute two concurrent single-precision operations,operand register 120 includes portions 120-1 and 120-2, operand register122 includes portions 122-1 and 122-2, operand register 124 includesportions 124-1 and 124-2, and result register 126 includes portions126-1 and 126-2. To explain further, each input value provided fromoperand registers 120, 122, 124, contains two single-precision operands,a “high” operand and a “low” operand. In other words, each input valueA, B, C represents two individual single-precision operands, and toillustrate this in FIG. 1A, the operand register 120 includes portions120-1 and 120-2, operand register 122 includes portions 122-1 and 122-2,operand register 124 includes portions 124-1 and 124-2, and resultregister 126 includes portions 126-1 and 126-2. Portion 120-1 of operandregister 120 contains a single-precision value corresponding to ahigh-multiplicand operand (AH), and portion 120-2 of operand register120 contains a single-precision value corresponding to alow-multiplicand operand (AL). Portion 122-1 of operand register 122contains a single-precision value corresponding to a high-multiplieroperand (BH), and portion 122-2 of operand register 122 contains asingle-precision value corresponding to a low-multiplier operand (BL).Portion 124-1 of operand register 124 contains a single-precision valuecorresponding to a high-addend operand (CH), and portion 124-2 ofoperand register 124 contains a single-precision value corresponding toa low-addend operand (CL). During a FMAC calculation the three highoperands 120-1, 122-1, 124-1 can be used to compute a high result,(AH*BH)+CH=RH, and the three low operands 120-2, 122-2, 124-2 can beused to compute a low result (AL*BL)+CL=RL.

In one implementation that is illustrated in FIG. 1A, the arithmeticprocessing unit can be implemented using a five pipeline stages.

First and Second Pipeline Stages

The first pipeline stage and the second pipeline stage include theregisters 230 and 232, a partial product array that include the partialproduct generator 240 and the partial product reducer 245, the signcontrol 260, the complement module 270 that includes portions 270-1 and270-2, and the alignment modules 272, 274, 276.

Although not illustrated in FIG. 1A, it is noted that during the firstpipeline stage, the exponent data path calculates the exponent of theproduct, and the multiply operation begins. During the second pipelinestage, the exponent data path (not illustrated in FIG. 1A) comparesexponents of the product and the addend, and selects the larger of thetwo as a preliminary exponent of the result.

As illustrated in FIG. 1A, operand register 120, operand register 122,and registers 230, 232 are coupled to partial product generator 240.During the first pipeline stage, the partial product generator 240 usesthe multiplier operands (AH, AL) 120-1, 120-2 and the multiplicandoperands (BH, BL) 122-1, 122-2 to generate thirty-two partial products242-1 . . . 242-32 that are provided to CSA array 250, and to generate athirty-third partial product 242-33 that is provided to CSA 280. Inparticular, partial product generator 240 uses the contents of register120-2 to calculate 13 least significant partial products 242-0 . . .242-12, and uses the contents of register 120-1 to calculate 13 mostsignificant partial products 242-20 . . . 242-32. The middle eightpartial products 242-13 . . . 242-19 can be calculated without using thevalue provided by either register 120-1 or 120-2.

FIG. 1B is a logical block diagram illustrating a partial productgenerator 240 in accordance with one specific embodiment of the presentdisclosure. The partial product generator 240 includes booth encoders240-A-0 . . . 240-A-32 and corresponding booth multiplexers 240-B-0 . .. 240-B-32. In one non-limiting embodiment that will be described withreference to FIG. 1B for purposes of illustration, the multiplieroperand includes sixty-four bits (b0 . . . b63), the multiplicandoperand includes sixty-four bits (b0 . . . b63), and there arethirty-three each of the booth encoders 240-A-0 . . . 240-A-32 and thebooth multiplexers 240-B-0 . . . 240-B-32.

Each of the booth encoders 240-A-0 . . . 240-A-32 is configured toperform booth encoding (e.g., a radix-4 Booth recoding) on a portion ofthe multiplier operand to generate particular select signals. Each ofthe booth encoders 240-A-0 . . . 240-A-32 receives three bits of themultiplier operand (or a “triplet of bits” of the multiplier operand)that serve as control signals. Between any two consecutive boothencoders 240-A-0 . . . 240-A-32 there is a one bit overlap between thetriplet of bits they receive. For instance, the first booth encoder240-A-0 would receive bit b0 of the multiplier operand along withpadding bits 0, 0, the second booth encoder 240-A-1 would receive bitsb0, b1, b2 of the multiplier operand, the third booth encoder 240-A-2would receive bits b2, b3, b4 of the multiplier operand, the fourthbooth encoder (not illustrated in FIG. 1B) would receive bits b4, b5, b6of the multiplier operand, and so on. With respect to the first boothencoder 240-A-0 it is noted that it receives only bit b0 of themultiplier operand along with padding bits 0, 0 since this is requiredwhen scanning the multiplier to correctly generate the halved number ofpossibly positive or negative 1× and 2× multiplicand multiples.

Each of the booth encoders 240-A-0 . . . 240-A-32 operates in paralleland performs booth encoding on the triplet of bits that it receives togenerate the particular select signals 241. A first signal can be usedto instruct a corresponding booth multiplexer to select a 1× multiple ofthe bits of the multiplicand operand, a second signal can be used toinstruct a corresponding booth multiplexer to select a 2× multiple ofthe bits of the multiplicand operand, and a third signal can be used toinstruct a corresponding booth multiplexer to select (1 b) acomplement/inversion of the first or second select signal. As such,depending on which of the first, second and/or third select signals areenabled, a corresponding booth multiplexer can be instructed to selecteither: (1) a 1× multiple of the bits of the multiplicand operand, or(2) a compliment/inversion of the 1× multiple of the bits of themultiplicand operand, or (3) a 2× multiple of the bits of themultiplicand operand, or (4) a complement of 2× multiple of the bits ofthe multiplicand operand.

Each of the booth multiplexers 240-B-0 . . . 240-B-32 is configured toreceive a particular version of the multiplicand operand, and oneparticular select signal generated by a corresponding one of the boothencoders 240-A-0 . . . 240-A-32. Each of the booth multiplexers 240-B-0. . . 240-B-32 operate in parallel on their respective inputs togenerate a partial product (such that the booth multiplexers 240-B-0 . .. 240-B-32 collectively generate N partial products). For example, inone implementation that is illustrated in FIG. 1B, there arethirty-three booth multiplexers 240-B-0 . . . 240-B-32 that can generatea total of thirty-three partial products PP1 . . . PP33.

In one implementation, each of the booth multiplexers 240-B-0 . . .240-B-32 receives all bits of the multiplicand; however, at each ofbooth multiplexers 240-B-1 . . . 240-B-32 (except for the first boothmultiplexer 240-B-0) the bits of the multiplicand are shifted left bytwo places (or bit positions). In other words, a first booth multiplexer240-B-0 receives the multiplicand operand, and each of the other boothmultiplexers 240-B-1 . . . 240-B-1 . . . 240-B-32 receive a particularbit-shifted version of the multiplicand operand that is shifted by twobits with respect to the prior one. As such, between any two consecutiveones of the booth multiplexers 240-B-1 . . . 240-B-32, the version ofthe multiplicand that they receive is shifted by two bits with respectto the one received by the prior booth multiplexer. For instance, thefirst booth multiplexer 240-B-0 would receive an unshifted version ofthe multiplicand operand, the second booth multiplexer 240-B-1 wouldreceive a version of the multiplicand operand that is shifted by twobits, the third booth multiplexer 240-B-2 would receive a version of themultiplicand operand that is shifted by four bits, the fourth boothmultiplexer 240-B-3 would receive a version of the multiplicand operandthat is shifted by six bits, and so on.

Partial product generator 240 is coupled to a partial product reducer245 (or “carry-save adder Wallace tree”) that includes a CSA array 250and a CSA 280. The partial products 242-1 . . . 242-33 are provided tothe partial product reducer 245. In particular, thirty-two partialproducts 242-1 . . . 242-32 are provided to CSA array 250, and athirty-third partial product 242-33 that is provided to CSA 280. Ingeneral, the partial product reducer 245 is used to compress thethirty-three partial products 241-1 . . . 241-33 to generateintermediate and carry 288 and sum 289 results. Further detailsregarding the partial product reducer 245 in accordance with thedisclosed embodiments will be described below with reference to FIG. 1C.

Portion 124-1 of operand register 124 is coupled to portion 270-1 ofcomplement module 270, and portion 124-2 of operand register 124 iscoupled to portion 270-2 of complement module 270. Sign control 260 isalso coupled to the complement modules 270-1, 270-2. If the sign control260 indicates that an effective subtract is being computed, portions270-1 and 270-2 will “flip” the bits of their input to produce theoutput. The outputs of complement module 270 portions 270-1, 270-2 arecoupled to alignment module 272.

FIG. 1A illustrates other details regarding the third pipeline stage,the fourth pipeline stage and the fifth pipeline stage that are notgermane to the disclosed embodiments, and therefore are not describedherein for sake of clarity. Further information regarding the thirdpipeline stage, the fourth pipeline stage and the fifth pipeline stageare described, for example, in U.S. patent application Ser. No.13/172,590, filed Jun. 29, 2011, entitled “METHODS AND APPARATUS FORCOMPRESSING PARTIAL PRODUCTS DURING A FUSED MULTIPLY-AND-ACCUMULATE(FMAC) OPERATION ON OPERANDS HAVING A PACKED-SINGLE-PRECISION FORMAT,”and assigned to the assignee of the present invention, which isincorporated herein by reference in its entirety.

FIG. 1C is a logical block diagram illustrating a compression treearchitecture of a partial product reducer 245 that is configured tooperate in accordance with a specific embodiment of the presentdisclosure. As mentioned above, the partial product reducer 245 includesthe carry-save adder (CSA) array 250 coupled the CSA 280. Although notillustrated in FIG. 1C, in some embodiments, the partial product reducer245 can include other elements such as storage elements (flip-flops),additional compressor levels that include additional carry-save adders,etc. For sake of clarity, these additional elements are not illustrated.In general, it is noted that the partial product reducer 245 receivespartial products (PP1 . . . PP33) 242-1 . . . 242-33 as its inputs. Forpurposes of illustration, in the following example, which illustratesone exemplary implementation, it is assumed that there are thirty-threepartial products 242-1 . . . 242-33.

The carry-save adder (CSA) array 250 includes a plurality of carry-saveadders 310-A . . . 310-H, 320-A . . . 320-D, 330-A, 330-B, 340 that aredistributed over four compressor levels (LEVELS 1 . . . 4). Thecarry-save adders 310-A . . . 310-H, 320-A . . . 320-D, 330-A, 330-B,340 and the CSA 290 are used to reduce the thirty-two partial products242-1 . . . 242-32 and the thirty-third partial product 242-33 to a129-bit carry vector 351-1 and a 128-bit sum vector 351-2 that representthe sum of the 33 partial products. In this particular implementation,the compressor levels (LEVEL 1, LEVEL 2) are part of the first pipelinestage, and compressor levels (LEVEL 3, LEVEL 4) are part of the secondpipeline stage.

The first compressor level (LEVEL 1) includes eight 4:2 carry-saveadders 310-A . . . 310-H that are configured to compress the pluralityof partial products (PP1 . . . PP32) except for a last partial product(P33) to generate first outputs. As such, in the first compressor level(LEVEL 1), each of the 4:2 carry-save adders 310-A . . . 310-H receivesfour partial products 242 and compresses them to generate a carry outputand a sum output. Each of the partial products 241-1 . . . 241-32 thatis input to the partial product reducer 245 is a bit vector thatincludes, for example, 73 bits (or is “73 bits wide”) in one exemplaryimplementation. For instance, 4:2 carry-save adder 310-A receives fourpartial products 242-1 . . . 242-4 and compresses them to generate acarry output 0 and a sum output 0 (e.g., two output vectors whose sum isequal to the sum of the PP1 . . . PP4 inputs 242-1 . . . 242-4), whereas4:2 carry-save adder 310-B receives four partial products 242-5 . . .242-8 and compresses them to generate a carry output 1 and a sum output1. Each of the carry and sum outputs generated by the 4:2 carry-saveadders 310-A . . . 310-H in the first compressor level (LEVEL 1) can be,for example, 81 bits (or is “81 bits wide”) in one exemplaryimplementation.

The second compressor level (LEVEL 2) includes four 4:2 carry-saveadders 320-A . . . 320-D that are configured to compress the firstoutputs (generated by the first compressor level 310) to generate secondoutputs. Each of the 4:2 carry-save adders 320-A . . . 320-D receivestwo carry inputs and two sum inputs, and compresses them to generate acarry output and a sum output. For instance, 4:2 carry-save adder 320-Areceives carry output 0, sum output 0, carry output 1 and sum output 1,and compresses them to generate a carry output 8 and a sum output 8,whereas 4:2 carry-save adder 320-B receives carry output 2, sum output2, carry output 3 and sum output 3 and compresses them to generate acarry output 9 and a sum output 9. Each of the carry and sum outputsgenerated by the carry-save adders 320-A . . . 320-D in the secondcompressor level (LEVEL 2) can be, for example, 97 bits (or is “97 bitswide”) in one exemplary implementation.

The third compressor level (LEVEL 3) includes two 4:2 carry-save adders330-A, 330-B that are configured to compress the second outputs(generated by the second compressor level 320) to generate thirdoutputs. Each of the 4:2 carry-save adders 330-A, 330-B receives twocarry inputs and two sum inputs, and compresses them to generate a carryoutput and a sum output. For instance, 4:2 carry-save adder 330-Areceives carry output 8, sum output 8, carry output 9 and sum output 9,and compresses them to generate a carry output 12 and a sum output 12,whereas 4:2 carry-save adder 330-B receives carry output 10, sum output10, carry output 11 and sum output 11 and compresses them to generate acarry output 13 and a sum output 13. Each of the carry and sum outputsgenerated by the carry-save adders 330-A, 330-B in the third compressorlevel (LEVEL 3) can be, for example, 130 bits (or is “130 bits wide”) inone exemplary implementation.

The fourth compressor level (LEVEL 4) includes a 4:2 carry-save adder340 that is configured to compress the third outputs (generated by thethird compressor level 330) and compresses them to generate a fourthcarry output 251-1 and a fourth sum output 251-2. For instance, the 4:2carry-save adder 340 receives carry output 12, sum output 12, carryoutput 13 and sum output 13 and compresses them to generate a carryoutput 14 and a sum output 14. The carry output 14 and sum output 14generated by the 4:2 carry-save adder 340 in the fourth compressor level(LEVEL 4) can be, for example, 128 bits (or is “128 bits wide”) in oneexemplary implementation.

The fifth compressor level (LEVEL 5) includes CSA 280. CSA 280 isanother 4:2 carry-save adder coupled to the CSA array 250. CSA 280receives carry output 14 that were generated by the 4:2 carry-save adder340, the partial product (PP33) 242-33, and compresses these inputs togenerate a carry output 288 that can be 130-bits in one exemplaryimplementation. It is noted that the aligned addend 277 generated by thealignment modules 274 is only used by the CSA 280 when executing amultiply and add operation, and therefore is not used in thisimplementation.

One drawback of a conventional APU architecture or layout used toimplement a multiplier module is that the booth encoder circuitry isarranged at periphery of multiplier module. As a result, the signalscommunicated from the booth encoders must travel a significant distanceto before reaching the booth multiplexers, and the wiring path lengthsbetween the booth encoders and corresponding booth multiplexers issignificant in this layout. In addition, the partial products generatedby the booth multiplexers must also travel a significant distance toreach the CSA circuitry that is implemented at various differentcompressor levels (when the CSA circuitry is physically separated fromthe booth multiplexers). This layout has numerous drawbacks.

It would be desirable to reduce layout area of the APU so that it has acompact layout. It would also be desirable if performance of the APU canbe improved, and if power consumed and dissipated by the APU can bereduced.

In accordance with the disclosed embodiments, a split or foldedmultiplier architecture is provided in which the booth encodercircuitry, booth multiplexer circuitry and compressor circuitry have anoptimized layout.

FIG. 2 is a block diagram illustrating the physical layout of a partialproduct array 240, 245 including the arrangement of booth encoders240-A, booth multiplexers 240-B, and compressors 280, 310, 320, 330, 340in accordance with a specific embodiment of the present disclosure.

As illustrated in FIG. 2, the partial product array 240, 245 has a splitor folded layout that is split into a low-side 302 and a high-side 304.Splitting the layout into the low-side 302 and the high-side 304 enablesthe multiplier to fit into a rectangular physical footprint which allowsfor efficient and compact integration in the datapath of a micropressordie. The high side of the partial product array, with redundant partialproducts on the high side for bits that spill over from the compression,is effectively slid under the low side to accomplish this task_.

In FIG. 2, the partial products PP1 . . . PP33 (that are generated bythe various booth multiplexers 240-B-0 . . . 240-B-32 and processed bythe various carry-save adders 310-A . . . 310-H, 320-A . . . 320-D,330-A, 330-B, 340) are indicated along the top and bottom of thediagram. As described above, the various carry-save adders 310-A . . .310-H, 320-A . . . 320-D, 330-A, 330-B, 340 make up the compressorlevels of the carry-save adder array 250.

As described above, bits of the multiplier operand are communicated froma memory source to the booth encoders 240-A. The booth encoders 240-Athen each use a portion of the multiplier operand to generate selectsignals that are then routed north 202 and south 204 (along the columnsin the direction of arrows 202, 204) to the booth multiplexers 240-Bthat are arranged above and below the booth encoders.

In accordance with the disclosed embodiments, the plurality of boothencoders 240-A-0 . . . 240-A-32 are arranged along a substantiallydiagonal path that extends between the low-side 302 and the high-side304.

This layout allows for a reduced distance wiring path between theexternally sourced multiplier input and booth encoders 240-A-0 . . .240-A-32. In one implementation, the booth encoder circuitry is arrangedbetween the low and high sides of the split/folded multiplier modulewith a substantially diagonal layout (e.g., diagonally embedded insidean M (e.g., 64) bit by N (e.g., 64) bit multiplier module). By arrangingthe booth encoders in this manner the wiring path lengths between theexternally sourced multiplier input and the booth encoders can bereduced and area for additional driver circuitry can be reduced.

In addition, the carry-save adders 310-A . . . 310-H, 320-A . . . 320-D,330-A, 330-B, 340 are each arranged in an interleaved manner with/amongthe booth multiplexers 240-B. This arrangement essentially seeds theplacement of the booth multiplexers 240-B to allow for the routingdistance between the signals for the booth multiplexers 240-B and thefirst compressor stage & the distance for the signals to all successivecompressor stages to be minimal for the Wallace tree type adder.

FIG. 3 is a block diagram illustrating an example of a physical layoutof booth multiplexers 240-B-0 . . . 240-B3 with respect to a carry-saveadder 310-A in accordance with a specific embodiment of the presentdisclosure. The booth multiplexers 240-B-0 . . . 240-B3 communicate thepartial products to the CSAs over interconnections between the circuits.As illustrated in FIG. 3, carry-save adder 310-A is centered among thebooth multiplexers 240-B-0 . . . 240-B3. Other variations of this areillustrated in FIG. 2.

In accordance with the disclosed embodiments, the booth multiplexers areinterleaved with CSAs of the various compressor levels that are used toreduce the partial products. By interleaving the compressor circuitrywith the booth multiplexer circuitry, the wiring path lengths that thepartial products must travel over to reach the CSAs of the variouscompressor levels can be reduced in comparison to conventionalapproaches. In aggregate, this interleaved arrangement helps to minimizethe total distance traveled by the partial products PP1 . . . PP33 asthey travel from the booth multiplexers 240 to the carry-save adders310-A . . . 310-H. The arrangement of the carry-save adders 310-A . . .310-H with respect to the carry-save adders 320-A . . . 320-D also helpsminimize the total wiring distance traveled by the carry and sum outputsof the carry-save adders 310-A . . . 310-H as they travel to thecarry-save adders 320-A . . . 320D. The arrangement of the carry-saveadders 320-A . . . 320-D with respect to carry-save adders 330-A, 330-Balso helps minimize the total wiring distance traveled by the carry andsum outputs of the carry-save adders 320-A . . . 320-D as they travel tothe carry-save adders 330-A, 330-B. The arrangement of the carry-saveadders 330-A, 330-B with respect to carry-save adder 340 also helpsminimize the total wiring distance traveled by the carry and sum outputsof the carry-save adders 330-A, 330-B as they travel to the carry-saveadder 340. Because the carry-save adders are arranged in such aninterleaved manner among the booth multiplexers 240-B, this can alsohelp to reduce the layout area occupied by this circuitry along with thepropagation delay though this circuitry.

In the particular non-limiting implementation that is illustrated inFIG. 2, each one of the carry-save adders 310-A . . . 310-H are arrangedhalfway between and coupled to two pairs of the booth multiplexers 240-Bto receive the four partial products from the two pairs of the boothmultiplexers 240-B. Similarly, the carry-save adders 320-A . . . 320-Dare arranged between two pairs of the booth multiplexers 240-B, and eachone of the carry-save adders 320-A . . . 320-D are also arranged halfwaybetween and coupled to two of the carry-save adders 310 such that anyparticular carry-save adder 320 is configured to receive first outputsfrom the two carry-save adders 310 that it is coupled to. In addition,carry-save adders 330-A, 330-B are arranged between two pairs of thebooth multiplexers 240-B (where each of the two pairs is arrangedbetween a pair of the carry-save adders 310). The carry-save adders330-A, 330-B are also arranged halfway between and coupled to two of thecarry-save adders 320 such that each one of the carry-save adders 330-A,330-B is configured to receive second outputs from the two carry-saveadders 320 that it is coupled to. Finally, the carry-save adder 340 alsoarranged between two pairs of the booth multiplexers 240-B, and isarranged halfway between and coupled to the carry-save adders 330 suchthat that the carry-save adder 340 receives the third outputs from thecarry-save adders 330 that it is coupled to. Some of the advantages ofarranging the carry-save adders of the various compressor levels amongthe booth multiplexers 240 in this manner are explained above.

In accordance with some implementations of the disclosed embodiments,including those illustrated in FIGS. 1A and 1C when the APU 200 isconfigured to implement a multiply-and-accumulate operation, the partialproduct reducer 245 can also include a carry-save adder 280, that servesas a fifth compressor level, coupled to the carry-save adder 340 of thecarry-save adder array 250. In other words, the carry-save adder 280 isimplemented in embodiments when the arithmetic processing unit 200 isconfigured to perform a multiply-and-accumulate operation on theoperands. This carry-save adder 280 is arranged in an interleaved mannerbetween one pair of the booth multiplexers 240-B-30, 240-B-31 and afinal booth multiplexer 240-B-32. The carry-save adder 280 is configuredto generate result outputs 288, 289 based on the partial product (PP 33)242-33, an aligned addend 277, the fourth carry output 251-1 and thefourth sum output 251-2. The result outputs comprise a carry resultoutput 288 and a sum result output 289.

Conclusion

Thus, in accordance with the disclosed embodiments, a split or foldedarchitecture is provided that has an optimized booth encoder layoutalong with interleaved booth multiplexers and compressors. This allowsfor the propagation delay through the multiplier, as well as the layoutarea of the partial product array (that includes the partial productgenerator 240 and the partial product reducer 245) to be reduced. Theimproved layout reduces wiring path lengths between the multiplier inputand booth encoders, as well as the wiring path lengths between the boothmultiplexers and compressor (or CSA) circuitry and also between allcompressor circuitry stages. Additional circuitry that would normally berequired to drive signals over greater distances can therefore beremoved. In addition, propagation delay is reduced, while die area andpower consumed can also be lowered.

While at least one exemplary embodiment has been presented in theforegoing detailed description, it should be appreciated that a vastnumber of variations exist. It should also be appreciated that theexemplary embodiment or embodiments described herein are not intended tolimit the scope, applicability, or configuration of the claimed subjectmatter in any way. Rather, the foregoing detailed description willprovide those skilled in the art with a convenient road map forimplementing the described embodiment or embodiments. It should beunderstood that various changes can be made in the function andarrangement of elements without departing from the scope defined by theclaims, which includes known equivalents and foreseeable equivalents atthe time of filing this patent application.

What is claimed is:
 1. A multiplier module for digitally multiplying amultiplicand operand by a multiplier operand, comprising: a partialproduct array having a folded layout that is split into a low-side and ahigh-side, comprising: a plurality of booth encoders that are eachconfigured to generate a particular select signal based on a portion ofthe multiplier operand, wherein the plurality of booth encoders arearranged along a substantially diagonal path that extends between thehigh-side and the low-side; a plurality of booth multiplexers thatcollectively generate a plurality of partial products based on themultiplicand operand and the select signals generated by the boothencoders; and a carry-save adder array comprising: a plurality ofcompressor levels that are interleaved with the plurality of boothmultiplexers.
 2. A multiplier module according to claim 1, wherein theplurality of compressor levels comprise: n compressor levels including afirst through nth compressor levels, wherein each of the n compressorlevels comprise at least one carry-save adder, and wherein eachcarry-save adder of a n^(th) compressor level is interleaved betweencarry-save adders of a n−1^(th) compressor level, and wherein each ofthe carry-save adders are arranged in an interleaved manner between atleast two booth multiplexers.
 3. A multiplier module according to claim2, wherein the plurality of compressor levels comprise: a firstcompressor level comprising: a plurality of first carry-save adders thatare configured to compress the plurality of partial products except fora last partial product to generate first outputs; a second compressorlevel comprising a plurality of second carry-save adders that areconfigured to compress the first outputs to generate second outputs; athird compressor level comprising a plurality of third carry-save addersthat are configured to compress the second outputs to generate thirdoutputs; and a fourth compressor level comprising a fourth carry-saveadder that is configured to compress the third outputs to generate afourth carry output and a fourth sum output, wherein each one of theplurality of first carry-save adders are arranged in an interleavedmanner among the booth multiplexers, wherein each one of the pluralityof second carry-save adders are arranged in an interleaved manner amongthe booth multiplexers, wherein each one of the plurality of thirdcarry-save adders are arranged in an interleaved manner among the boothmultiplexers, and wherein the fourth carry-save adder is arranged amongthe booth multiplexers.
 4. A multiplier module according to claim 3,wherein each one of the plurality of first carry-save adders arearranged between two pairs of the booth multiplexers and coupled to thetwo pairs of the booth multiplexers to receive the partial products fromthe two pairs of the booth multiplexers that that particular firstcarry-save adder is coupled to.
 5. A multiplier module according toclaim 3, wherein each one of the plurality of second carry-save addersare arranged between two pairs of the booth multiplexers, wherein eachone of the plurality of second carry-save adders are arranged betweenand coupled to two of the first carry-save adders such that thatparticular second carry-save adder is configured to receive firstoutputs from the two of the first carry-save adders that it is coupledto.
 6. A multiplier module according to claim 5, wherein each one of theplurality of third carry-save adders are arranged between two pairs ofthe booth multiplexers, wherein each one of the plurality of thirdcarry-save adders are arranged between and coupled to two of the secondcarry-save adders such that that particular third carry-save adder isconfigured to receive second outputs from the two of the secondcarry-save adders that it is coupled to.
 7. A multiplier moduleaccording to claim 6, wherein the fourth carry-save adder is arrangedbetween two pairs of the booth multiplexers, wherein the fourthcarry-save adder is arranged between and coupled to the third carry-saveadders such that that the fourth carry-save adder is configured toreceive the third outputs from the third carry-save adders that it iscoupled to.
 8. A multiplier module according to claim 1, wherein each ofthe plurality of booth encoders is configured to perform booth encodingon a portion of the multiplier operand generate a particular selectsignal.
 9. A multiplier module according to claim 8, wherein the each ofthe plurality of booth encoders is configured to receive a triplet ofbits of the multiplier operand that serve as control signals, andperform booth encoding on that triplet of bits to generate theparticular select signal, wherein each particular select signal is usedto instruct a corresponding booth multiplexer to select one of: (1) a 1×multiple of the bits of the multiplicand operand, (2) a 2× multiple ofthe bits of the multiplicand operand, (3) a complement of the 1×multiple of the bits of the multiplicand operand, or (4) a complement of2× multiple of the bits of the multiplicand operand.
 10. A multipliermodule according to claim 1, wherein each of the plurality of boothmultiplexers is configured to: receive inputs comprising: a particularversion of the multiplicand operand, and one of the particular selectsignals generated by one of the booth encoders; and wherein each of thebooth multiplexers operate in parallel on their respective inputs togenerate a partial product such that the plurality of booth multiplexerscollectively generate the plurality of partial products.
 11. A processorconfigured to process a multiplicand operand and a multiplier operand,comprising: a partial product array having a folded layout that is splitinto a low-side and a high-side, comprising: a partial productgenerator: a plurality of booth encoders that are arranged along asubstantially diagonal path that extends between the high-side and thelow-side, wherein each of the plurality of booth encoders is configuredto perform booth encoding on a portion of the multiplier operand togenerate a particular select signal; and a plurality of boothmultiplexers, wherein each of the booth multiplexers is configured toreceive a particular version of the multiplicand operand, and one set ofthe particular select signals generated by one of the booth encoders,and to generate a partial product such that the plurality of boothmultiplexers collectively generate a plurality of partial products; apartial product reducer configured to receive and reduce the pluralityof partial products to generate result outputs, the partial productreducer comprising: a carry-save adder array comprising a plurality ofcompressor levels that are interleaved with the plurality of boothmultiplexers.
 12. A processor according to claim 11, wherein theplurality of compressor levels of the carry-save adder array, comprise:a plurality of carry-save adders that are each arranged in aninterleaved manner with the plurality of booth multiplexers.
 13. Aprocessor according to claim 12, wherein the plurality of compressorlevels comprise: a first compressor level comprising: a plurality offirst carry-save adders that are configured to compress the plurality ofpartial products except for a last partial product to generate firstoutputs; a second compressor level comprising a plurality of secondcarry-save adders that are configured to compress the first outputs togenerate second outputs; a third compressor level comprising a pluralityof third carry-save adders that are configured to compress the secondoutputs to generate third outputs; and a fourth compressor levelcomprising a fourth carry-save adder that is configured to compress thethird outputs to generate a fourth carry output and a fourth sum output,wherein each one of the plurality of first carry-save adders arearranged in an interleaved manner among the booth multiplexers, whereineach one of the plurality of second carry-save adders are arranged in aninterleaved manner among the booth multiplexers, wherein each one of theplurality of third carry-save adders are arranged in an interleavedmanner among the booth multiplexers, and wherein the fourth carry-saveadder is arranged among the booth multiplexers.
 14. A processoraccording to claim 13, wherein each one of the first carry-save adders,the second carry-save adders, the third carry-save adders, and thefourth carry-save adder comprises: a four-to-two carry-save adder thatreceives four partial products and reduces the four partial products totwo intermediate partial products.
 15. A processor according to claim13, wherein each one of the plurality of first carry-save adders arearranged between two pairs of the booth multiplexers and coupled to thetwo pairs of the booth multiplexers to receive the partial products fromthe two pairs of the booth multiplexers that that particular firstcarry-save adder is coupled to.
 16. A processor according to claim 13,wherein each one of the plurality of second carry-save adders arearranged between two pairs of the booth multiplexers, wherein each oneof the plurality of second carry-save adders are arranged between andcoupled to two of the first carry-save adders such that that particularsecond carry-save adder is configured to receive first outputs from thetwo of the first carry-save adders that it is coupled to.
 17. Aprocessor according to claim 16, wherein each one of the plurality ofthird carry-save adders are arranged between two pairs of the boothmultiplexers, wherein each one of the plurality of third carry-saveadders are arranged between and coupled to two of the second carry-saveadders such that that particular third carry-save adder is configured toreceive second outputs from the two of the second carry-save adders thatit is coupled to.
 18. A processor according to claim 17, wherein thefourth carry-save adder is arranged between two pairs of the boothmultiplexers, wherein the fourth carry-save adder is arranged betweenand coupled to the third carry-save adders such that that the fourthcarry-save adder is configured to receive the third outputs from thethird carry-save adders that it is coupled to.
 19. A processor accordingto claim 11, wherein the arithmetic processor is configured as a fusedmultiply-and-accumulate processor that is configured to perform amultiply-and-accumulate operation on the operands, and wherein thepartial product reducer further comprises: a final compressor levelcoupled to the carry-save adder array, comprising: a final carry-saveadder arranged in an interleaved manner between one pair of the boothmultiplexers and a final booth multiplexer.
 20. A processor according toclaim 13, wherein the arithmetic processor is configured as a fusedmultiply-and-accumulate processor that is configured to perform amultiply-and-accumulate operation on the operands, and wherein thepartial product reducer further comprises: a fifth compressor levelcoupled to the carry-save adder array, comprising: a fifth carry-saveadder, coupled to fourth carry-save adder of the fourth compressorlevel, and being configured to generate result outputs based on apartial product, an aligned addend, the fourth carry output and thefourth sum output, wherein the result outputs comprise a carry resultoutput and a sum result output.
 21. A processor according to claim 11,wherein the each of the plurality of booth encoders is configured toreceive a triplet of bits of the multiplier operand that serve ascontrol signals, and perform booth encoding on that triplet of bits togenerate the particular select signal, and wherein each particularselect signal is used to instruct a corresponding booth multiplexer toselect one of: (1) a 1× multiple of the bits of the multiplicandoperand, (2) a 2× multiple of the bits of the multiplicand operand, (3)a complement of the 1× multiple of the bits of the multiplicand operand,or (4) a complement of 2× multiple of the bits of the multiplicandoperand.
 22. A processor according to claim 11, wherein each of thebooth multiplexers is configured to receive inputs comprising: theparticular version of the multiplicand operand, and one of theparticular select signals generated by one of the booth encoders, andwherein each of the booth multiplexers operate in parallel on theirrespective inputs to generate a partial product such that the pluralityof booth multiplexers collectively generate the plurality of partialproducts.
 23. A processor according to claim 11, wherein the pluralityof compressor levels comprise: n compressor levels including a firstthrough nth compressor levels, wherein each of the n compressor levelscomprise at least one carry-save adder, and wherein each carry-saveadder of a n^(th) compressor level is interleaved between carry-saveadders of a n−1^(th) compressor level, and wherein each of thecarry-save adders are arranged in an interleaved manner between at leasttwo booth multiplexers.